feat(eval): Add skill evaluation framework#5
Conversation
- CI workflow runs on PRs touching plugin code - Uses claude-code-action with --plugin-dir to load plugins - Local testing via ./eval/run.sh --simple - 5 test cases for skill auto-triggering - JSON schema for structured output (future API key support)
Greptile OverviewGreptile SummaryAdds comprehensive skill evaluation framework using
Confidence Score: 4/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant PR as Pull Request
participant GH as GitHub Actions
participant CCA as claude-code-action
participant Claude as Claude Code
participant Plugins as Plugin Skills
participant YAML as Test Case YAML
PR->>GH: Trigger on plugin/** or eval/** changes
GH->>GH: Matrix strategy: spawn 5 parallel jobs
loop For each test case
GH->>CCA: Run with OAuth token
CCA->>Claude: Initialize with --plugin-dir flags
Claude->>Plugins: Load hope, product, wordsmith, founder, career
CCA->>Claude: Send eval prompt
Claude->>YAML: Read test case file
YAML-->>Claude: Extract prompt + expected_behaviors
Claude->>Claude: Process prompt (skills auto-trigger)
Claude->>Plugins: Skills evaluate trigger conditions
Plugins-->>Claude: Skill activates (or doesn't)
Claude->>Claude: Self-evaluate response vs expected_behaviors
Claude->>Claude: Generate verdict (PASS/PARTIAL/FAIL)
Claude-->>CCA: Return result with verdict
CCA-->>GH: Output result
GH->>GH: Parse verdict with grep
alt VERDICT: PASS
GH->>GH: Print success ✓
else VERDICT: PARTIAL
GH->>GH: Print warning ⚠
else VERDICT: FAIL or no verdict
GH->>GH: Fail job with error
end
end
GH->>PR: Report CI status
|
| elif echo "$RESULT" | grep -q "VERDICT: PARTIAL"; then | ||
| echo "⚠ $TEST_NAME: PARTIAL" |
There was a problem hiding this comment.
style: VERDICT: PARTIAL exits with success but may indicate issues. Consider if partial verdicts should fail CI to maintain quality gates.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: .github/workflows/eval.yml
Line: 72:73
Comment:
**style:** `VERDICT: PARTIAL` exits with success but may indicate issues. Consider if partial verdicts should fail CI to maintain quality gates.
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.
PR Review: Skill Evaluation FrameworkThank you for this well-structured addition! The evaluation framework is a solid approach to ensuring skill quality. StrengthsArchitecture & Design
Documentation
Code Quality
Issues & Recommendations1. CRITICAL: Test Case Date InconsistencyLocation: All test YAML files (eval/cases/skill-triggers/*.yaml:26) All test files show created: 2024-12-15 but the current date is 2025-12-16. This appears to be a typo. Fix: Update to 2025-12-15 in all 5 test files. 2. CI Workflow: Missing Plugin Directory CheckLocation: .github/workflows/eval.yml:56 The workflow references --plugin-dir ./founder --plugin-dir ./career but there is no validation that plugin directories are valid. Recommendation: Add a validation step before running tests. 3. Shell Script: Partial Test Name MatchingLocation: eval/run.sh:49 Documentation shows ./eval/run.sh --simple hope-gate but the actual test is hope-gate-completion.yaml. This will fail. Fix: Either update docs to use full names, OR add fuzzy matching for partial test names. 4. Schema: reasoning Should Be RequiredLocation: eval/schema.json:45 The reasoning field is optional, but the CI prompt says explain your reasoning briefly. This could cause validation failures. Fix: Add reasoning to required fields array. 5. Test Coverage: Missing Negative CasesLocation: eval/cases/skill-triggers/ All 5 test cases validate positive triggers. No tests for skills that should NOT trigger on certain prompts. Recommendation: Add negative test cases to ensure skills do not over-trigger. 6. Test Coverage: Missing founder/career TestsLocation: .github/workflows/eval.yml:56 CI workflow loads founder and career plugins but no test cases exist for them. Fix: Either add test cases or remove from --plugin-dir flags. 7. Security: Potential Log ExposureLocation: .github/workflows/eval.yml:61-66 The full RESULT output is echoed to CI logs. If Claude response contains sensitive repo info, it will be exposed. Mitigation: Sanitize output or only log verdict line. 8. Performance: Timeout May Be Too ShortLocation: .github/workflows/eval.yml:57 timeout_minutes: 5 might be tight for large repos or slow runners. Recommendation: Increase to 10 minutes. 9. Documentation: Missing Failure Triage GuideLocation: docs/dev/evaluations.md:82-93 Troubleshooting section does not explain how to debug failed CI tests locally. Recommendation: Add Debugging Failed Tests section with step-by-step instructions. SummaryCode Quality: 4/5 Recommended ActionsMust Fix:
Should Fix: Nice to Have: Overall AssessmentThis is a high-quality PR that introduces a well-thought-out evaluation system. The self-evaluation approach is clever and avoids external dependencies. The documentation is exemplary. With the fixes above (especially the date correction and schema update), this is ready to merge. Verdict: APPROVE with minor changes requested |
- Remove `timeout_minutes` input (not valid for claude-code-action) - Add `show_full_output: true` for better CI debugging - Add session_id to debug output - Improve error messages for execution_file issues
PR Review: Skill Evaluation FrameworkSummaryThis PR adds a comprehensive skill evaluation framework to test auto-triggering across the moo.md plugin ecosystem. The implementation is well-architected and follows the project's conventions. StrengthsArchitecture & Design
Code Quality
Conventions Adherence
Issues & Recommendations1. Security: Secrets Exposure RiskSeverity: HIGH The workflow uses Recommendation: # Remove or set to false
show_full_output: false # Only show summaryIf full output is needed for debugging, add a comment explaining why and ensure skills never process sensitive data. 2. Error Handling: Missing Claude CLI CheckSeverity: MEDIUM The local test script ( Recommendation (eval/run.sh:5-8): # Add after SCRIPT_DIR definition
if \! command -v claude &> /dev/null; then
echo 'Error: claude CLI not found. Install Claude Code first.'
exit 1
fi3. Bug: Test Case Schema MismatchSeverity: MEDIUM Test YAML files include fields that are not used anywhere:
The CI workflow only reads Options:
4. Robustness: Partial CI Failures Not HandledSeverity: MEDIUM The CI exits 0 on Current (.github/workflows/eval.yml:94-95): elif echo "$RESULT" | grep -q "VERDICT: PARTIAL"; then
echo "⚠ $TEST_NAME: PARTIAL" # Exit code 0Recommendation: elif echo "$RESULT" | grep -q "VERDICT: PARTIAL"; then
echo "::warning::$TEST_NAME returned PARTIAL verdict"
# Optionally: exit 1 to fail the build on partialConsider adding a workflow input to toggle strict mode (fail on PARTIAL). 5. Performance: Serial Test Execution LocallySeverity: LOW
Recommendation: 6. Code Quality: Hardcoded Plugin ListSeverity: LOW Plugin directories are hardcoded in two places:
If a new plugin is added (e.g., Recommendation: 7. Documentation: Missing Failure ExamplesSeverity: LOW README shows successful output but not what failures look like. Developers won't know if Add to eval/README.md: 8. Best Practice: Verdict Pattern Too LooseSeverity: LOW Verdict detection uses Recommendation: if tail -n 5 "$EXECUTION_FILE" | grep -q "^VERDICT: PASS$"; thenMinor Nits
Test Coverage AssessmentCurrent tests cover:
Recommendation: Add at least one test for Security Checklist
Final VerdictRecommendation: Approve with changes This is solid work that adds real value. The architecture is sound and the documentation is thorough. Address the HIGH severity issue (#1) before merging, and consider the MEDIUM severity items for follow-up PRs. Must-Fix Before Merge
Should-Fix (Before or After Merge)
Nice-to-Have5-8. Performance, maintenance, and UX improvements Great job on this foundation! The eval framework will make maintaining plugin quality much easier. Confidence: 85% - Thorough review of all files, tested shell script logic mentally, but have not run the CI workflow end-to-end. |
Summary
claude-code-actionwith--plugin-dirto load plugins from checked-out repo./eval/run.sh --simpleTest plan
./eval/run.sh --simple hope-gate-completionworks